\begin{savequote}[8cm]
Essentially, all models are wrong, but some of them are useful.
\qauthor{George E. P. Box}
  
\end{savequote}


%\chapter{A spatial model of network growth}
\chapter{Modelling social network growth over space}
\chaptermark{Modelling network growth}
\label{ch:model}

Connections established between users of online social networks are often
influenced by basic social mechanisms such as preferential attachment, which
captures how popular users attract more and more new links, and triadic closure,
which models how shared friends are powerful indicators of future
friendships of a user. However, we have seen that geographic distance
is also a factor which impacts online social links; our results
indicate that spatial proximity fosters the creation of online social
ties. Yet, the underlying spatial processes that might shape the creation of 
new social links are still largely unknown.

Different evolutionary models have been proposed and tested to explain  the
growth of online social networks; in many cases they describe the
behaviour of individual nodes that results in some global structural properties
observed in real-world systems, such as power-law degree distributions and high
clustering coefficients~\cite{LBKT08:microscopic}.  The fundamental importance of
such models is due to the fact that they often highlight universal
characteristics of user behaviour; for instance,  mechanisms such as preferential
attachment and triadic closure are thought to mimic the actions of individuals
creating their social connections, thus offering practical insights to predict
future links~\cite{LK07:prediction, LLC10:prediction}.  However, these
growth models often neglect factors that are not inherently connected to the
structure of the social network itself; in particular, they neglect spatial
distance.

Among the many interesting questions arising from the inclusion of spatial
information in social network analysis, we aim to understand how
distance is affecting the establishment  of new online social ties. In the previous
chapter we have seen that by considering only social or only spatial
factors one cannot reproduce the overall properties of real spatial social
networks. The open question is therefore how to merge space and social
influences successfully to reproduce what is observed in real graphs.  In this chapter we
answer this question: we present a detailed study of the temporal evolution of a social network by
quantifying the impact of spatial and social factors on the growth of a popular location-based
social service. We exploit our findings to propose a new model of
network growth, which is able to describe the social and spatial properties observed in real
networks.

\paragraph{Chapter outline}
In this chapter we study the temporal evolution of a 
social network using daily snapshots of a popular location-based
social service, with information  about users' location and social connections.
We discuss how the network grows over time in
Section~\ref{sec:temporal_dataset}.  Using this fine-grained temporal
information about network evolution, in Section~\ref{sec:mle} we test 
whether different edge attachment and triadic closure models can explain
the observed data, adopting an approach based on \textit{likelihood estimation}.
This methodology allows us to compare quantitatively different evolution models
according to their statistical ability to reproduce real events.
Section~\ref{sec:temporal_evolution} studies the temporal patterns of
individual user behaviour, namely the lifetime of 
a node, that is the amount of time a user is actively creating new edges, and 
the inter-edge waiting time, which governs the amount of time elapsed before a node
will create a new edge.

Based on these findings, in Section~\ref{sec:gravitational_model} we
  describe a new model of network growth which is able to reproduce both the
    social and spatial properties observed in the real data. 
Our results show that geographic constraints
should be considered when dealing with online social networks and suggest
that a gravitational attachment model is able to capture the effect
of geographic distance on users. We consider the implications of our study in
Section~\ref{sec:disc4}, reviewing related work in Section~\ref{sec:related4}.
Section~\ref{sec:concl4} closes the chapter.

%We are able to show that that geographic distance plays an important role in the
%creation of new social connections:  node degree and spatial distance
%simultaneously shape user connections in a \textit{gravitational attachment}
%process that mimics the attraction forces between physical bodies, influenced by
%mass and distance. At the same time, triadic closure is still shaping the
%creation of social links, although this process appears to be driven purely by
%social factors.  We propose and evaluate  evaluate a new model of network growth
%which is able to describe the social and spatial properties observed in real
%networks, confirming our findings.

\section{Measuring network growth}
\label{sec:temporal_dataset}
In this section we describe the temporal evolution over 4 months of a popular
location-based social service, Gowalla. We have acquired traces
which include fine-grained temporal information about when individual social
connections between users were created; this enables us to explore the temporal
evolution of  the Gowalla social network.

\subsection{Data collection}
\label{subsec:temporal_gowalla_data}
We have downloaded daily snapshots of Gowalla data between May and
August 2010 by accessing their public API, obtaining for each snapshot the
data outlined in Section~\ref{sec:methodology}.
%Users are identified by consecutive numeric IDs, hence we were able to exhaustively gather
%information about all registered user accounts on the service each
%day, downloading details regarding their profiles, their friends and
%their check-ins.  We also acquired the geographic coordinates of
%each place available on the service, as uploaded by Gowalla users.  
%
This dataset represents a sequence of \textit{complete} snapshots of a
large-scale location-based service, allowing us  to study how the
social network grows over time and over space. In particular, we have
temporal information about all the social links created during our measurement
process; this enables us to study the social and spatial factors that may influence
the creation of social links.
%Note that even though we have
%no temporal information about social connections created before our measurement
%period, we can still estimate when each user joined the
%network by observing the list of previous check-ins.

The dataset contains about 400,000 registered users at the end of our measurement period, but
only a fraction of them are actively using Gowalla, while many of these accounts
do not show any type of activity: no social connections and no check-ins. There
are 183,709 users with at least one check-in and 162,239 with at least
one friend. We focus our analysis on 122,030 active users who have both friends
and check-ins. 
\begin{figure}[t]
  \centering
  \subfigure[Nodes]{
    \label{fig:gowalla_node_growth}
     \includegraphics[width=\doubleplotscale]{images/gowalla_growth_nodes.pdf}}
 \subfigure[Links]{
    \label{fig:gowalla_edge_growth}
     \includegraphics[width=\doubleplotscale]{images/gowalla_growth_links.pdf}}
    \caption{Temporal evolution of the number of nodes (a) and links (b) 
            in the spatial social network of Gowalla users, as
            captured by our periodic measurements.}
    \label{fig:gowalla_growth}
\end{figure}


\subsubsection{Notation}
Formally, we represent the spatial social network of Gowalla users as an
undirected graph. We denote by \emt{N} and \emt{K} the total numbers of nodes and
edges, while \emt{G_t = (N_t, K_t)} is the graph composed of the earliest \emt{t} edges
\emt{(e_1, \ldots, e_t)}, with \emt{G_T} being the final network at the end of the measurement
process. The time when edge \emt{e} was created is \emt{t(e)} and \emt{t(u)}
is the time when node \emt{u} joined the service, that is, the time when it
created its first connection or made its first check-in.
The age of node \emt{u} at time \emt{t} is denoted as \emt{a_u(t) = t-t(u)}.
The degree of node \emt{u} at time \emt{t} is \emt{k_u(t)},
while the number of nodes with degree \emt{k} at time \emt{t} is denoted as
\emt{n_k(t)}.

%Every node of this network is embedded in a metric space: in our case, the
%metric space is the 2-dimensional surface of the planet and we adopt the
%great-circle distance over the Earth as distance metric.  We define the location
%of each user as the geographic location of the place where he/she has more
%check-ins overall at the end of the measurement period. We denote by
%\emt{D_{ij}} the distance between nodes \emt{i} and \emt{j}.  We assign a length
%\emt{l_{ij}} to each social link so that \emt{l_{ij} = D_{ij}}: since node
%positions do not change with time, link lengths and distances between nodes do
%not change either.

As in the previous chapter, every node of this network is embedded in a metric
space: we assign each user to the geographic location of the place where he/she
has made the greatest number of his/her check-ins at the end of the measurement
period. Since we do not change user locations over time, link lengths and
distances between nodes do not change either.


%\begin{table}[t]
%\centering
%%\begin{tabular}{|c|c|}
%%\hline
%%\emt{N} & 122,030 \\
%%\emt{K} & 577,014 \\
%%\emt{\langle k \rangle} & 9.28\\
%%\emt{\langle C \rangle} & 0.254 \\
%%\emt{N_{GC}} & 116,910 (95.8\%)\\
%%\emt{D_{EFF}} & 5.43 \\
%%\emt{\langle l \rangle} & 1,792 km \\
%%\emt{\langle D \rangle} & 5,479 km \\
%%\hline
%%\end{tabular}
%\begin{tabular}{|c|c|c|c|c|c|c|c|}
%\hline
%\emt{N} &  \emt{K} &  \emt{\langle k \rangle} &  \emt{\langle C \rangle}  & \emt{N_{GC}} &  \emt{H_{EFF}}  & \emt{\langle l \rangle} &  \emt{\langle D \rangle} \\
%\hline
%122,030  & 577,014  & 9.28 & 0.254  & 116,910 (95.8\%) & 5.43  & 1,792 km  &
%5,663 km  \\
%\hline
%\end{tabular}
%\caption{Properties of the spatial social network at the end of the measurement
%    period: number of nodes \emt{N},
%    number of edges \emt{K}, average node degree \emt{\langle k \rangle}, average
%        clustering coefficient \emt{\langle C \rangle}, number of nodes in the
%        largest connected component \emt{N_{GC}}, 
%   90th percentile of shortest path length distribution \emt{H_{EFF}}, 
%    average geographic distance between nodes \emt{\langle D \rangle}, average link
%    length \emt{\langle l \rangle}.
%}
%\label{tab:gowalla_temporal_network_properties}
%\end{table}


\subsection{Basic properties}
The number of nodes and the number of links
grow approximately linearly over time, with the number of links growing at a
faster pace. On average, the network gains about 375 new nodes and about 1,900 new edges per
day, as shown in Figure~\ref{fig:gowalla_growth}.
%The main properties of the spatial social network at the end of the measurement
%period are also reported in Table~\ref{tab:gowalla_temporal_network_properties}.
Note that the graph at the end of our measurement period corresponds to the
Gowalla dataset already studied in Chapter~\ref{ch:structure}.

%Hence, the average degree of the social network slowly increases: as depicted in
%Figure~\ref{fig:gowalla_densification}, we
%find that a relationship  \emt{K_t \propto N_t^\rho} holds with an exponent
%\emt{\rho \approx 1.11}, which denotes superlinear growth of the edges with respect
%to the number of nodes, a sign of densification of the network as time
%passes by. 

\begin{figure*}[t]
  \centering
  \subfigure[Node degree]{
    \label{fig:temporal_gowalla_degree_dist}
     \includegraphics[width=\tripleplotscale]{images/temporal_gowalla_degree_dist.pdf}}
 \subfigure[Node age]{
    \label{fig:temporal_gowalla_age_dist}
     \includegraphics[width=\tripleplotscale]{images/temporal_gowalla_age_dist.pdf}}
 \subfigure[Link length]{
    \label{fig:temporal_gowalla_distance_dist}
     \includegraphics[width=\tripleplotscale]{images/temporal_gowalla_link_dist.pdf}}
    \caption{Complementary Cumulative Distribution Function (CCDF) of node
      degree (a), node age (b), and link geographic length (c) at the end of the measurement period.}
    \label{fig:temporal_gowalla_dists}
\end{figure*}

%At the end of the measurement period the spatial social network under analysis
%contains 122,030 nodes and 577,014 edges, with an average degree of 9.28 and an
%average clustering coefficient of 0.254.  The largest connected component includes 96\% of
%all observed nodes: the 90th percentile of the distribution of shortest path
%distances between nodes is 5.43 hops.  There is evidence of the small-world effect: while the shortest path
%lengths tend to be only a few hops, the average clustering coefficient is high,
%        suggesting that strong local structures tend to be connected by
%        occasional shortcuts.
In Figure~\ref{fig:temporal_gowalla_degree_dist} we present again the
degree distribution, which exhibits a heavy tail.
In contrast, both the distributions of node age and link geographic distance, in
Figures~\ref{fig:temporal_gowalla_age_dist}-\ref{fig:temporal_gowalla_distance_dist},
  do not exhibit heavy tails but instead an almost exponential decay (notice the
      linear x-axis). There is a large fraction of short-range geographic
  connections: about 50\% of social links span less than 200 km, with only a
  small fraction being longer than 4,000 km. At the same time, the distribution
  of node age shows that nodes have joined the network with some irregular
  temporal spikes; overall, about 99\% of all nodes have joined the service in
  the last 300 days.

%\begin{figure}[ht]
%  \centering
%     \includegraphics[width=\plotscale]{images/densification.pdf}
%    \caption{Densification of the social network: the number of edges grows
%        superlinearly with the number of nodes.}
%    \label{fig:gowalla_densification}
%\end{figure}

\section{Modelling network growth}
\label{sec:mle}
In this section  we study how the creation of individual social links is
influenced by both global and local properties of the network, adopting a
methodology based on the Maximum Likelihood Principle to evaluate and compare
how a set of models describe the empirical data.

In more detail, we analyse two core facets of temporal network evolution:
\begin{itemize}
\item \textbf{how edges are created}: we test different attachment models that
select the target of a new connection given the social and spatial properties of
network nodes;
\item \textbf{how social triangles are created}:  we test
a family of triadic closure mechanisms based on node properties and spatial distance.
\end{itemize}

Our results demonstrate that node degree and spatial distance are
simultaneously influencing edge creation, suggesting that \textit{a
gravitational attachment model describes real network evolution better than
purely social or spatial models}; we also find that \textit{social factors
are more important than spatial constraints when an edge closes a
triangle}. These findings will be revisited and exploited in
Section~\ref{sec:gravitational_model}, where we will define a new model of network
growth.

\subsection{Maximum likelihood estimation}
\label{subsec:mle}
When assessing whether a model reproduces empirical properties observed in the
data, a common approach is to test if global properties are equally found in
the real data and in the output of the model.  Instead, we take advantage of
the fine-grained temporal information present in our traces and we adopt a
quantitative approach to compare how different models describe the empirical
traces.  We directly compute the likelihood that a model has of generating the
events observed in our sequence of traces. The Maximum Likelihood Principle can
then be applied: historically, this principle has been used to  
compare numerically a family of models and, as a result, pick the ``best'' model (and
parameters) to explain the data~\cite{Sti02:stats}.

Studying networks with likelihood methods requires a probabilistic model
describing the evolution of the graph itself. In other words, the network is
considered the result of an evolutionary stochastic process which has driven
its growth, both in terms of new nodes and new edges~\cite{WBH06:likelihood}. For instance, the
preferential attachment model discussed in Section~\ref{subsec:growth}
describes the evolution of a network in terms of the probability of connection
between new nodes and existing nodes. Given real data about the evolution of a network, one can
test the extent to which the assumptions of preferential attachment model
are supported by the data.

In our case, estimating the likelihood of a model \emt{M} involves considering
each individual edge \emt{e_t=(i,j)} created during our measurement period and
computing the likelihood \emt{P_M(e_t)} that the source \emt{i} selects the
actual destination \emt{j} according to the model \emt{M}. Thus, the likelihood
\emt{P_M(G)} that model \emt{M} reproduces graph \emt{G} is given by the product
of the individual likelihoods according to  model \emt{M}:
\begin{equation}
P_M(G) = \prod_t P_M(e_t) 
\end{equation}
We use log-likelihood for better numerical accuracy, obtaining
\begin{equation}
\label{eq:mle}
\log(P_M(G)) = \log(\prod_t P_M(e_t)) = \sum_t \log(P_M(e_t))
\end{equation}
Equation~\eqref{eq:mle} suggests a simple algorithm to compute the
log-likelihood of a given model \emt{M}: for each new edge  created during the
graph evolution, we compute the probability that it would be created according
to  model \emt{M}, we take the logarithm of this probability and we sum all the values
obtained for each edge. When this procedure is repeated for several models, we
can choose the model with the highest likelihood to explain the data. 

Since every edge is undirected and we do not have information about which user
initiated the social contact, we consider every new edge \emt{e_t=(i,j)} in
both directions in the rest of our analysis, in order to avoid any bias. This
methodology can be extended easily to handle directed graphs.
\begin{figure}[t]
  \centering
     \includegraphics[width=\plotscale]{images/gowalla_degree_prob.pdf}
    \caption{Probability of creating a new social link as a function of node
        degree.}
    \label{fig:degree_attach}
\end{figure}

\subsection{Modeling edge attachment}
\label{subsec:modeling_edge}
We investigate the effect of three global characteristics on the creation of
individual social links between users: node degree, node age and spatial
distance between nodes.


\paragraph{Attachment by node degree}
According to the preferential attachment model~\cite{BA99:ba},
the probability of creating a new connection to a node is proportional
to the number of its existing connections. The cumulative advantage
held by high-degree nodes results in a degree distribution with a heavy
tail, as some nodes accumulate a large number of connections. We test
whether a similar mechanism is compatible with our data by computing
the probability \emt{P_{deg}(k)} that a new link will be created with a node
with degree \emt{k}:
\begin{equation}
\label{eq:p_degree}
P_{deg}(k) = \frac{| \{ e_t : e_t=(i,j) \land k_j(t-1) = k \}|}{\sum_t n_k(t-1)}
\end{equation}

\noindent where the numerator counts how many edges have been created
connecting a node with degree \emt{k} and the normalisation factor considers
how many nodes with degree \emt{k} were present when the edge was created. If preferential
attachment is not governing the growth, then  \emt{P_{deg}(k)} should not depend on
\emt{k}. However, we see in Figure~\ref{fig:degree_attach} that \emt{P_{deg}(k)
\propto k^{0.74}}, showing that nodes with higher degree are more likely to
attract new edges than nodes with fewer connections. Although the trend is not
exactly linear as in the original preferential attachment model, node degree is
related to the creation of new edges.



\begin{figure}[t]
  \centering
     \includegraphics[width=\plotscale]{images/gowalla_age_prob.pdf}
    \caption{Probability of creating a new social link as a function of node
      age.}
    \label{fig:age_attach}
\end{figure}

\paragraph{Attachment by node age}
The period of time a node has been part of the service could also be a factor
affecting the creation of edges. Older nodes might have more visibility or
more authority in the network; at the same time, when new users join the network
they might experience intense activity as they search the network for potential
connections. 
We compute \emt{E(a)}, the number of edges created by nodes of age \emt{a}
normalised by dividing by the total number of nodes that ever achieved at least age \emt{a}:

\begin{equation}
\label{eq:p_age}
E(a) = \frac{| \{ e_t : e_t=(i,j) \land t(e)-t(i) = a\}|}{|\{n : T-t(n) \geq a \}|}
\end{equation}

\noindent where \emt{T} is the time when the last node joined the network during the
measurement period. As depicted in Figure~\ref{fig:age_attach}, there is a spike
at age 0: this represents nodes that join the network, create some links and then
never come back.  The number of created edges then quickly goes down with age
\emt{a} but grows again for higher values of \emt{a}. Many links are created
when a node joins the network, followed by lower levels of edge creation; older
nodes then tend to establish further links.



\paragraph{Attachment by spatial distance}
Another important factor for edge attachment may be geographic distance.
We compute the probability \emt{P_{geo}(d)} that a new edge spans geographic distance
\emt{d}, normalised by dividing by the number of nodes at distance \emt{d} from the source:

\begin{equation}
\label{eq:p_distance}
P_{geo}(d) = \frac{| \{ e_t : e_t=(i,j) \land D_{ij} = d\}|}{\sum_t |\{n:D_{in} = d \}|}
\end{equation}
\begin{figure}[t]
  \centering
     \includegraphics[width=\plotscale]{images/gowalla_distance_prob.pdf}
    \caption{Probability of creating a new social link as a function of the
        geographic distance of the node. 
          Logarithmic binning has been adopted to estimate the
    number of samples in each range of values.}
    \label{fig:geo_attach}
\end{figure}

Our data show how \emt{P_{geo}(d)} decreases with distance \emt{d}, as reported
in Figure~\ref{fig:geo_attach}, even though the trend appears noisy.  In
particular, the data roughly follow a trend \emt{P_{geo}(d) \approx d^{-\alpha}}
with \emt{\alpha=0.6}.  While a similar functional form has been found in other
spatial social networks, but with different exponents \emt{\alpha}, 
in this case it is measured at the level of individual edge creation events.
Geographic distance affects the edge creation process in a clear way: as already
discussed in Section~\ref{subsec:friendprob} in the static scenario, even when
considering individual edge creation events, longer links have a lower
probability of appearance than short-range ones.


\begin{figure}[t]
\centering
    \includegraphics[width=\plotscale]{images/geogap_prob_1.pdf}
    \caption{Probability Distribution Function of \emt{\lambda(1)}, the geographic
        span of the first social link created by a user. Logarithmic binning has been adopted to estimate the
    number of samples in each range of values.}
    \label{fig:geogap_prob}
\end{figure}


\subsubsection{Evidence of gravity effects in network growth}
%The effect of node degree on network evolution is well captured by the
%preferential attachment model, where the probability of connection between nodes
%\emt{i} and \emt{j}, \emt{P_{ij}}, is proportional to the degree of node \emt{j}, \emt{P_{ij}
%\propto d_i}.  
%This model generates networks with degree distribution exhibiting
%a heavy tail, as there are a few nodes, the so called ``hubs'', that accumulate
%an extremely high number of connections.  Real-world examples such as
%transportation and communication networks can be described by a preferential
%attachment model, but geographic distance is an important parameter as well. In
%fact, long-range connections tend to exist mainly between well-connected
%hubs~\cite{ES90:gravity, JWS08:korean, KCR09:gravity}.
%
%The effect of geographic distance can included in the attachment probability,
%    \emt{P_{ij} \propto d_i
%d_j f(D_{ij})}, where \emt{f} is a decreasing \textit{deterrence function} of the
%geographic distance \emt{D_{ij}} between the nodes. Thus, long
%distances tend to be covered only to connect to important hubs, while nodes
%with less connections become attractive when they can be reached over a
%short distance. 
%When the deterrence function has a simple functional form such as \emt{f(d) \sim
%d^{-\alpha}}, then the probability of a connection between two nodes becomes
%similar to the gravitational attraction between celestial bodies, \emt{P_{ij}
%\propto \frac{d_i d_j}{D_{ij}^{\alpha}}}. Hence, this family of  attachment models
%has been known as \textit{gravity models}~\cite{Car56:gravity}.
%We want now to understand whether there is any evidence that similar factors are shaping the
%evolution of our real spatial social network.

We discussed in Section~\ref{sec:disc3} how gravity models are a
suitable choice to combine social and spatial properties. Our aim is now to
uncover evidence to support the hypothesis that a similar mechanism can
reproduce patterns observed in the real data.

A consequence of the gravity model is that nodes with higher degree tend to
attract longer links. We therefore define \emt{\lambda_i(k)} as the geographic
length of the \emt{k}-th edge created by user \emt{i} and we study how the
probability distribution of \emt{\lambda(k)} changes for different values of
\emt{k}.  The probability distribution of \emt{\lambda(1)} is reported in
Figure~\ref{fig:geogap_prob}.  The distribution can be roughly divided into two
regions: social connections shorter than 5 km, a threshold compatible with the size of many
urban areas, exhibit a constant probability,
  whereas the probability of creating longer ties quickly decays. 
  
This preference for short-range ties confirms our previous analysis of 
edge attachment mechanisms.  The influence of degree \emt{k} on the geographic
properties of social links appears strong; as shown by 
Figure~\ref{fig:geogap_avg}, both the average and the median value of the
geographic length \emt{\langle \lambda(k)\rangle} of the \emt{k}-th edge
increase with \emt{k}. While the average length of the first edge is about
1,100 km, the 100th edge is about 2,400 km. The median value shifts
accordingly with \emt{k}, increasing from 150 km to more than 900 km for
higher degrees.  These findings are compatible with a gravity model where node
degree and geographic distance simultaneously influence social connections
created over space.

\begin{figure}[t]
\centering
    \includegraphics[width=.75\columnwidth]{images/geogap_avg.pdf}
    \caption{Average and median geographic span gap of the \emt{k}-th
        edge created by a node as a function of \emt{k}.}
    \label{fig:geogap_avg}
\end{figure}


\subsubsection{Evaluation of attachment models}
We have discovered that individual node properties and
geographic distance affect edge creation. Our aim is now to
understand what type of edge attachment models better approximate 
network temporal evolution.

We deliberately choose simple models, since 
our goal is not to reproduce exactly the temporal evolution
of the network but, rather, to understand which factors
mainly drive its growth. We consider four different edge attachment models, each
one with a single parameter \emt{\alpha}:
\begin{itemize}
\item[\textsf{D}:] the probability of creating an edge with node \emt{n} is proportional to a
power \emt{\alpha} of its degree: \emt{k_n(t)^{\alpha}}; 
\item[\textsf{A}:] the probability of
creating an edge with node \emt{n} is proportional to a power \emt{\alpha} of its age:
\emt{a_n(t)^{\alpha}};
\item[\textsf{G}:] the probability of creating an edge with node \emt{n} is inversely
proportional to a power \emt{\alpha} of its geographic distance from \emt{i}:
\emt{D_{in}^{-\alpha}}; 
\item[\textsf{DG}:] the probability of creating an edge with node \emt{n} is proportional to
its degree and inversely proportional to a power \emt{\alpha} of its geographic
distance from \emt{i}: \emt{k_n(t) D_{in}^{-\alpha}}
\end{itemize}

We will make use of the Maximum Likelihood Principle presented in
Section~\ref{subsec:mle} to compare and evaluate which model and which
parameters better reproduce the real evolution. 
%Given a model we unfold the
%temporal network evolution and, for each edge, we compute the probability that
%the edge would have been created according to the model, resulting at the end
%in the likelihood that a single model generates the entire evolution. This
%approach allows us to directly compare models in a consistent way, rather than
%considering whether they capture some global properties of the network: a
%higher likelihood corresponds to a model which better describes the real data.
%
Figure~\ref{fig:mle} displays the log-likelihood values obtained by each model
as a function of the parameter \emt{\alpha}. First, we note that the models
\textsf{G} and \textsf{DG}, which incorporate geographic distance, have higher
log-likelihood than the other two models \textsf{D} and \textsf{A}, with the
overall maximum log-likelihood achieved by \textsf{DG}. The maximum log-likelihood for
\textsf{DG} is achieved for \emt{\alpha \approx 0.6}, which is in agreement with the
results obtained measuring \emt{P_{geo}(d)}. Node age does not seem a key factor for
edge attachment, as the model \textsf{A} shows decreasing values of
log-likelihood for values of \emt{\alpha} between 0 and 2. Model
\textsf{D} reaches its highest
log-likelihood for \emt{\alpha
  = -0.8} and fails to outperform \textsf{G} and \textsf{DG}. 
%Indeed, we have tested models which also
%combine node age with geographic distance and node degree, but they do not
%exhibit significant improvements with respect to the models without node age.  
Hence, it
seems that \textit{the main driving factors in edge attachment are node degree and
geographic distance} and that a gravity model which combines them is the most
suitable option.

\begin{figure}[t]
\centering
    \includegraphics[width=\plotscale]{images/total_mle.pdf}
    \caption{Log-likelihood of each edge attachment model as a function of their
        parameter \emt{\alpha}. The gravity model \textsf{DG} outperforms all the
          others.}
    \label{fig:mle}
\end{figure}

\subsection{Modelling triadic closure}
\label{subsec:modeling_triangles}
The edge attachment models previously proposed  only take into account the
influence of global network properties on new edge creation. However, local
network properties can be equally or more important; for instance, new links
often tend to connect users who already share friends, creating social
triangles that are extremely common in social networks~\cite{LK07:prediction}.
Hence, in this section we study how new links result in new social
triangles and whether different triangle-closing models can reproduce the
patterns observed in the data.

\subsubsection{The importance of triangle-closing links}
Social connections tend to link together individuals who are already at close
social distance: the vast majority of new links tend to be between nodes that
already share at least one friend, thus only 2 hops away from each other, with
larger social distances exponentially less likely~\cite{LBKT08:microscopic}. We
observe a similar pattern in our data: Figure~\ref{fig:num_hops} shows that the
number of edges \emt{E_h} that connect nodes \emt{h} hops away exponentially decays
with \emt{h}. A few edges also connect nodes that were not in the same
connected component, as when a new node joins the network and creates its first
link.

\begin{figure}[t]
  \centering
  \subfigure[]{
    \label{fig:num_hops}
    \includegraphics[width=\doubleplotscale]{images/link_hops.pdf}}
  \subfigure[]{
    \label{fig:prob_hops}
    \includegraphics[width=\doubleplotscale]{images/link_probhops.pdf}}
    \caption{Number of new links \emt{E_h} created between nodes \emt{h} hops away
      (a) and probability \emt{P_h} that a new link connects nodes \emt{h} hops away
        (b). The single \emt{E_h}  value at \emt{h = 0} denotes the number of
        edges connecting nodes previously in separate disconnected components.}
    \label{fig:hops}
\end{figure}

A better understanding of this process can be achieved by considering not only
how many new links connect nodes \emt{h} hops away, but also considering the number
of nodes at that social distance. In fact, since \emt{E_h} exponentially
decreases with \emt{h} and the number of available nodes increases with \emt{h},
          the probability \emt{P_h} that a new link spans \emt{h} hops must be decreasing much
          faster than exponentially. More precisely, we compute \emt{P_h} as 

\begin{equation}
\label{eq:p_h}
P_h = \frac{| \{ e_t : e_t=(i,j) \land h_{t-1}(i,j) =h \}|}{\sum_t |\{n: h_{t-1}(i,n) = h \}|}
\end{equation}

\noindent where \emt{h_{t}(i,j)} is the number of hops between nodes \emt{i} and \emt{j} at
time \emt{t}. Figure~\ref{fig:prob_hops} plots \emt{P_h} as a function of \emt{h}: the
probability decays quickly  and
finally reaches a constant value. Thus, triadic closure seems to be the
predominant factor shaping network growth over time: new edges are most
likely to connect people who already share at least one friend, closing social
triangles. 



Given the importance of triadic closure, we focus on how this process is
affected by spatial properties. We want to understand whether there is any
interplay between the geographic distance and the social distance that a new
link spans. A first indication is given by the geometric average geographic distance
\emt{D_h} of all new edges that connect nodes previously \emt{h} hops away, shown in
Figure~\ref{fig:geo_hops}.
There is an evident trend: \textit{social connections at
shorter social distance tend to have higher geographic distances, while links
spanning more hops have lower spatial distance}. A potential explanation
is that both social and geographic distance tend to 
affect the edge creation process: a new link is created either between users
sharing friends, even if they are far from each other, or between spatially
close users, even if they have no friends in common. In
particular, it appears that \textit{geographic proximity is as important as social
closeness}: both factors are shaping the network, but in different ways.

\begin{figure}[t]
\centering
    \includegraphics[width=\plotscale]{images/hop_geo.pdf}
    \caption{Geometric average geographic distance \emt{D_h} of new links created between
      nodes \emt{h} hops away. The single \emt{D_h}  value at \emt{h = 0} denotes
        the average geographic distance of links connecting nodes previously in
        separate disconnected components.}
    \label{fig:geo_hops}
\end{figure}

In summary, our analysis of triadic closure confirms that two users
sharing at least one friend are much more likely to create direct connections
than two users without friends in common.  At the same time, geographic distance
appears again as a driving force, even if in a different way, influencing the
creation of edges between users without friends in common.

\begin{figure}[t]
  \centering
    \includegraphics[width=\plotscale]{images/triadic.pdf}
    \caption{ In a triangle-closing model node \emt{s} creates an edge by
selecting first an intermediate node \emt{i}, which then selects target node
\emt{t}: the edge \emt{(s,t)} is thus created. Different strategies to select a
node's neighbour can be combined. All candidate nodes two hops away are shaded in
grey; the baseline model picks at random among these candidates.}
    \label{fig:triadic}
\end{figure}


\subsubsection{Triangle-closing models}
Since about 60\% of new edges close social triangles, such triangle-closing
edges represent an important aspect of network growth. Hence, our aim is now to
understand what factors influence which node to choose when a edge is closing a
triangle. Again, we make use of the Maximum Likelihood Principle to test whether different triangle-closing models would be able to generate the
triangles created during  the real network evolution.

We consider the case when a source node \emt{s} chooses another target node
\emt{t} located two hops away to create a new link, as illustrated in
Figure~\ref{fig:triadic}. A simple model would be for node \emt{s} to
choose \emt{t} uniformly at random from all the nodes at a distance of 2 hops, which
will be our baseline model.  We then take into account more complex models where
node \emt{s} first chooses according to a given strategy an intermediate node \emt{i}
among its neighbours and then picks a target \emt{t} among \emt{i}'s neighbours with a
potentially different strategy. The edge \emt{(s,t)} is then created, closing
the triangle \emt{(s,i,t)}. 

Since every strategy involves only choosing a node \emt{i}
among the neighbours of a given node \emt{n}, we consider five different strategies to
choose \emt{i}:

\begin{enumerate}
\item\textsf{random}: uniformly at random among node \emt{n}'s neighbours;
\item\textsf{shared}: proportional to the number of shared friends between \emt{i}
and \emt{n};
\item\textsf{degree}: proportional to the degree of the neighbour \emt{i};
\item\textsf{distance}: inversely proportional to the geographic distance
between \emt{i} and  \emt{n};
\item\textsf{gravity}: proportional to the degree of the
neighbour \emt{i} and inversely proportional to the geographic distance between \emt{i} and \emt{n}.
\end{enumerate}


Since there are 5 different triangle-closing models, there is a total of 25
different combinations. We compute the log-likelihood of each combination and we measure the percentage
improvement over the log-likelihood of the baseline model.  The results are
presented in Table~\ref{tab:triangle_mle}: the general trend is that
\textsf{random} and \textsf{shared} offer the largest improvements over the
baseline, with a maximum improvement of 14.54\% in the combination
\textsf{shared}-\textsf{random} and 12.34\% for \textsf{random}-\textsf{random}.
Models based on degree or on distance have performance much lower than
the baseline, with degradation of up to  40\% when the \textsf{gravity} model is
adopted. In particular, the \textsf{random}-\textsf{random} model works
surprisingly well, as it favours connections between nodes that have multiple
2-hop paths between them and that have higher degrees, while being extremely
simple and computationally fast.
\begin{table}[t]
\centering
\begin{tabular}{|c||c|c|c|c|c|}
\hline
& \textsf{random} & \textsf{shared} & \textsf{degree} & \textsf{distance} & \textsf{gravity} \\
\hline
 \textsf{random} & \textbf{12.34} & 9.48 & -3.47 & -28.17 & -35.26 \\
 \textsf{shared} & \textbf{14.54} & 11.47 & -0.95 & -24.74 & -34.46 \\
 \textsf{degree} & 7.33 & 5.16 & -6.79 & -25.17 & -41.98 \\
 \textsf{distance} & -0.92 & -3.70 & -16.94 & -39.32 & -41.53 \\
 \textsf{gravity} & 2.71 & 0.25 & -12.11 & -33.01 & -43.18 \\
\hline
\end{tabular}
\caption{Performance of different triangle closing models:  rows show the model
  used to pick the intermediate node, and columns show the model used to then
    pick the target node.  The value in each cell gives the percentage
    improvement of model log-likelihood over the baseline model,  which chooses
    a random node two hops away from the source.}
\label{tab:triangle_mle}
\end{table}

These results show that \textit{triadic closure is mainly driven by social processes,
while geographic distance is not an important factor}. Nonetheless,
triangle-closing mechanisms only model some aspects of network evolution, as
non-local edges are still needed to globally connect the network
and create the small-world effect.  In summary, social processes at a local
level seem complementary to spatial factors that are shaping the network
at a global level. As we will see, we need both to model real
network evolution successfully.




\section{Temporal aspects of network growth}
\label{sec:temporal_evolution}
After considering the edge attachment and the triangle-closing processes, we now
shift our attention to the temporal properties of the network evolution. In this
section we study how users create new connections as they spend
more time on the network.  Our aim is to understand these temporal patterns in
order to capture them and reproduce them in a network growth model, rather than
discussing their statistical description. 

At first, we consider the amount of time users are
active on the network, their lifespan, and then we investigate whether nodes with
different degrees tend to establish new edges at a different pace and at
different geographic distances. We consider only nodes that have joined the
network after our measurement process started, in order to avoid any bias in the
estimation of their temporal properties: in this way we are observing their
entire temporal evolution from the very first moment they appear in the system.

\begin{figure}[t]
  \centering
    \includegraphics[width=\plotscale]{images/temporal_gowalla_span_dist.pdf}
    \caption{Complementary Cumulative Distribution Function (CCDF) of node
        lifespan and exponential fit.}
    \label{fig:lifespan}
\end{figure}

\subsection{Node lifespan}
We define the \textit{lifespan} of a node to be  the temporal difference between the
time the node created its last edge and the time when the node joined the
service.
The lifespan of a node is likely to affect the network evolution, since nodes
that cease to be active stop creating new edges, affecting the global
properties of the whole network.

Figure~\ref{fig:lifespan} shows the distribution of lifespan for all users:
the distribution shows approximately exponential behaviour, with a deviation
only at longer lifespans for few users  who were early adopters and started
using the service from the very first days. The fit is reasonably
good for a wide range of lifespan values and it can be used to capture and
reproduce how long nodes stay active in the network.


\subsection{Inter-edge temporal gap}
Different users can show significant differences in the pace at which they add new edges:
some users can be faster and more active than others. In addition, 
users who have been active for a longer period on the service may attract new friends at a faster
rate. We define \emt{\delta_i(k)} as the temporal gap between
the \emt{k}-th and \emt{k+1}-th edges of user \emt{i} and we study the distribution of
\emt{\delta(k)} across all users for different values of \emt{k}. 
\begin{figure}[t]
  \centering
     \includegraphics[width=\plotscale]{images/gap_prob_1.pdf}
    \caption{Probability Mass Function (PMF)  of \emt{\delta(1)}, the
        temporal gap elapsing between the times when the first and the second
            edges are created by a user. 
%            The fits show a power law, an
%            exponentially truncated power law and a shifted exponential.
            The fits are a truncated power law
            \emt{\delta(1)^{-\alpha}
        exp(-\delta(1)/\beta)}, a shifted exponential \emt{\delta(1)^{\beta-1}
        exp(-\lambda \delta(1)^\beta)} and a power law \emt{\delta(1)^{-\alpha}}.
    }
\label{fig:gap_prob}
\end{figure}


Figure~\ref{fig:gap_prob} displays the probability distribution of
\emt{\delta(1)},
the amount of time between the first and the second edges created by a user.
  Even though many users add their second edge after a few days, some users wait
  several weeks.  Hence, there is a wide range of variability in how quickly
  nodes start adding new edges after they join the network. The distribution
  can be captured by different functional forms: we find that an exponentially
  truncated power law \emt{p(\delta(1)) \propto \delta(1)^{-\alpha_1}
  exp(-\delta(1)/\beta_1)} yields a slightly higher log-likelihood than a pure
  power-law, a shifted exponential and an exponential, even though the average
  log-likelihood improvement over the exponential fit is below 5\%. For our
  modelling
  purposes we will use the exponentially truncated power law, which provides the highest
  likelihood even for different values of \emt{k}.

Then, we study the effect of the current degree \emt{k} on the temporal behaviour 
of the user: in particular, we are interested in seeing whether the probability
distribution of \emt{\delta(k)} changes with \emt{k}.  A first indication is given in
Figure~\ref{fig:gap_avg}, which plots the average temporal gap \emt{\langle
\delta(k) \rangle} between the \emt{k}-th and \emt{k+1}-th edge for different values of
\emt{k}: users with higher degrees tend to wait, on average, for a shorter amount of
time before adding a new edge. In fact, users wait on average 20 days before
adding their second edge but only 7 days when they have about 100 friends.


It is not surprising that nodes with higher degree add links at a faster pace:
given a fixed temporal period, as in our measurement, higher degree nodes add
more links than lower degree ones, so their activity has to be greater in the
same temporal period. Nonetheless, we are still interested in capturing this
heterogeneous temporal behaviour, as this fosters heterogeneity in the degree
distribution as well~\cite{LBKT08:microscopic}.

%Another interpretation about this result is that users that have many
%connections at the end of the measurement period ought to have a higher
%frequency of edge creation, in order to obtain higher degrees. More than in
%causality, we are interested in the fact that this temporal pattern correlates
%with some users accumulating many more connections than other. This
%heterogeneity is crucial in real online social networks, hence our aim will be to
%reproduce it.


We find that the same truncated power law \emt{p(\delta(k)) \propto
\delta(k)^{-\alpha_k} exp(-\delta(k)/\beta_k)} holds for different values of
\emt{k}, always offering the fit with the highest log-likelihood.  While
we find that \emt{\alpha_k}
tends to be unrelated to \emt{k}, the exponential cut-off \emt{\beta_k} becomes smaller as \emt{k}
grows larger, as seen in Figure~\ref{fig:gap_beta}.  The
effect of this trend is that nodes with higher degrees are less likely to wait
for a longer time span, as the truncated tail of the power law
\emt{P(\delta(k))} increasingly constrains higher gap values.

\begin{figure*}[t]
  \centering
  \subfigure[]{
    \label{fig:gap_avg}
     \includegraphics[width=\doubleplotscale]{images/gap_avg.pdf}}
  \subfigure[]{
    \label{fig:gap_beta}
     \includegraphics[width=\doubleplotscale]{images/gap_beta.pdf}}
    \caption{Arithmetic average value \emt{\langle \delta(k) \rangle} (a) and
      exponential cut-off \emt{\beta_k} (b) of the
      truncated power law \emt{p(\delta(k)) \propto \delta(k)^{-\alpha_k}
        exp(-\delta(k)/\beta_k)}, which approximates the probability distribution
          of the  temporal gap \emt{ \delta(k) } between the \emt{k}-th
        and \emt{k+1}-th edges as a function of node degree \emt{k}.}
\end{figure*}

%\begin{figure*}[t]
%  \centering
%  \subfigure[]{
%    \label{fig:gap_alpha}
%     \includegraphics[width=\doubleplotscale]{images/gap_alpha.pdf}}
%  \subfigure[]{
%    \label{fig:gap_beta}
%     \includegraphics[width=\doubleplotscale]{images/gap_beta.pdf}}
%    \caption{Parameters \emt{\alpha_k} (a) and \emt{\beta_k} (b) in the
%      truncated power law \emt{p(\delta(k)) \propto \delta(k)^{-\alpha_k}
%        exp(-\delta(k)/\beta_k)}, which approximates the probability distribution
%          of the  temporal gap \emt{ \delta(k) } between the \emt{k}-th
%        and \emt{k+1}-th edge as a function of node degree \emt{k}.}
%\end{figure*}


\section{A new spatial model of network growth}
\label{sec:gravitational_model}
In this section we build upon the set of results found in this chapter and we
propose a new model of network growth with spatial information.
Our model combines a \textit{gravitational attachment} process with a
triangle-closing model to algorithmically grow a spatial network
edge-by-edge. We demonstrate that the resulting synthetic network exhibit
social and spatial properties of the true network, whereas a similar model
without the effect of geographic distance does not. 

We have seen that a gravity-based mechanism describes how
new edges are created (in Section~\ref{subsec:modeling_edge}), while we have
discussed how triadic closure is mainly shaped by social factors rather than
geographic ones (in Section~\ref{subsec:modeling_triangles}). These two mechanisms seem to
be complementary; while the former is responsible for edges connecting together
different parts of the network, the latter seems involved in the creation of
local edges between nodes that already share a friend. We 
analysed how nodes tend to create new edges faster and faster as they acquire more
connections in Section~\ref{sec:temporal_evolution}. Building on all these results our
aim is now to define a network growth model that is able to capture and
reproduce the spatial and social properties of the real network.


\subsection{A new gravitational attachment model}
Following the methodology presented in~\cite{LBKT08:microscopic},  we
describe our model as an algorithm to grow a network one node, and one
edge, at a time:
\begin{enumerate}
\item A new node \emt{u} joins the network 
%according to a certain arrival discipline and
and positions itself in  physical space.
\item Node \emt{u} samples its lifetime from an exponential distribution;
\item Node \emt{u} adds its first edge to some node \emt{v} according to a gravity model, with
probability directly proportional to the degree of \emt{v}  and inversely
proportional to their geographic distance;
%\item Node \emt{u} with degree \emt{k} samples a time gap \emt{\delta} from a distribution
%\emt{p(\delta(k)) \propto \delta(k)^{-\alpha_k} exp(-\delta(k)/\beta_k)}  and then
%goes to sleep for \emt{\delta} time steps;
\item Node \emt{u} with degree \emt{k} samples a time gap \emt{\delta} from a
truncated power-law probability distribution, dependent on degree \emt{k},
and then goes to sleep for \emt{\delta} time steps;
\item When node \emt{u} wakes up, if its lifetime has not expired yet it creates a
two-hop new edge using the \emt{\textsf{random}-\textsf{random}} triangle-closing
model and then repeats step 4.
\end{enumerate}

The different processes included in this model are meant to reproduce the most
important properties observed in spatial social networks. The combination of
exponential node lifetimes and degree-dependent inter-edge waiting times allows
few nodes to accumulate a higher number of connections, resulting in a
heavy-tailed degree distribution. The gravitational attachment favours
geographically shorter
links and introduces correlations between the number of connections and the
spatial properties of these connections, replicating the heterogeneity observed
across users.  Finally, the purely social triangle-closing mechanism mimics how
social triads are not affected by spatial constraints.

\begin{figure*}[t]
  \centering
  \subfigure[]{
    \label{fig:model_degree}
    \includegraphics[width=\doubleplotscale]{images/model_degree.pdf}}
  \subfigure[]{
    \label{fig:model_distance}
    \includegraphics[width=\doubleplotscale]{images/model_distance.pdf}}
    \caption{Comparison of our gravity-based model (GM), preferential attachment
    model (PA) and real network: Probability Mass Function
        (PMF) of node degree (a) and  Cumulative Distribution Function (CDF) of
        link geographic length (b). }
\end{figure*}

%The model can be adapted to different scenario by changing how new nodes join
%the network, i.e., uniformly over time or with increasing rate, how the lifetime
%is sampled, where the nodes are situated, how the parameters \emt{\alpha_k} and
%}\beta_k} change with \emt{k} and so on.

\subsection{Evaluation}
In order to test our model we take the Gowalla network at the beginning of our
measurement period, denoted with \emt{T_1} and we simulate its growth by adding
the missing nodes, with their real geographic locations, according to when they
joined the real network. However, once they join the network they add new edges
according to our algorithmic model.
%Thus, each new node samples its lifetime, connects to an
%existing node with the gravity model, goes to sleep and repeatedly closes
%triangles when it wakes up.
We stop the evolution of the network when it reaches the same number of edges
as the real graph at the end of the measurement period, or when all node
lifetimes have expired.

To assess better the performance of our model, we compare it to a similar model
where in Step 3 a node creates an edge according to the preferential attachment
model, thus ignoring geographic distance and considering only node degree in
the attachment probability.  We refer to our new gravitational attachment model as
GM and to the preferential attachment model, which ignores spatial properties,
   as PA. Both models include the same triadic closure and inter-edge time gaps.
The two models are run 100 times with different random seeds and then their
properties are averaged over all these realisations.   We compute and compare
the properties of the networks by only considering edges added after \emt{T_1},
both in the real network and in the simulated models, to avoid the
properties of the initial graph \emt{G_{T_1}} influencing the final result. 
   
In our comparison we consider four different characteristics. First, we compute the
degree distribution and the probability distribution of geographic link length,
in order to assess whether these basic social and spatial properties are
correctly reproduced.  The degree distributions observed in the real network and
in the two models are
shown in Figure~\ref{fig:model_degree}: both models are able to reproduce the
distribution, replicating the social properties of the real network. As
shown in  Figure~\ref{fig:model_distance}, the probability distribution of link
geographic length is better approximated by the GM model, while the PA model
results in social links with longer geographic length. As
expected, the PA model fails to create those short-range links found in the real
network.

%\begin{figure*}[t]
%  \centering
%    \includegraphics[width=\plotscale]{images/model_correlation.pdf}
%    \caption{Comparison of our gravity-based model (GM), preferential attachment
%    model (PA) and real network: average link geographic length for all
%        nodes with a given degree as a function of node degree. }
%    \label{fig:model_correlation}
%\end{figure*}


Then, we focus on users and on their heterogeneity
using two measures already described in
Section~\ref{sec:socio_spatial}, briefly summarised here.
For every user we compute the \textit{friend distance}
and we plot the geometric mean
value of this measure for all users with \emt{k} connections, as a function of
\emt{k}. Similarly, for every user we compute the \textit{geographic triangle
  length}  and then 
%  by considering the average length of the three links for every triangle the user
%belongs to, and then taking the average of these values across all the triangles
%the users belong to. 
we plot the geometric mean  value of this measure for
all users with \emt{k} connections, as a function of \emt{k}. We adopt the
geometric mean to combine the values of users with the same degree
because such values span several orders of magnitude and we aim to emphasise
smaller values that correspond to short-range distance. These two user measures
will shed light on whether the two models capture the correlations between user
degree and socio-spatial properties that we discussed in
Chapter~\ref{ch:structure}. 

These two measures highlight a large difference between the
GM and PA models. When considering the average friend distance of a user as a function of
the user degree, as seen in Figure~\ref{fig:model_correlation},
both the GM and the PA models show an increasing trend, as the original
data. However, the PM model results in values
systematically higher than the GM model.
Similarly, when considering the average geographic triangle length in
Figure~\ref{fig:model_triangle_correlation}, we find that the GM model
reproduces the increasing trend observed in the real graph more closely than the
PA model, which fails to capture how users with fewer than 10 connections exhibit
low values of geographic triangle length.
\begin{figure*}[t]
  \centering
  \subfigure[]{
    \label{fig:model_correlation}
    \includegraphics[width=\doubleplotscale]{images/model_correlation.pdf}}
  \subfigure[]{
    \label{fig:model_triangle_correlation}
    \includegraphics[width=\doubleplotscale]{images/model_triang_length_corr.pdf}}
    \caption{Comparison of our gravity-based model (GM), preferential attachment
    model (PA) and real network: average friend distance for all
        nodes with a given degree (a) and
      average geographic triangle length for all
        nodes with a given degree (b), both as a function of node degree. }
\end{figure*}

%\begin{figure*}[t]
%  \centering
%    \includegraphics[width=\plotscale]{images/model_triang_length_corr.pdf}
%    \caption{Comparison of our gravity-based model (GM), preferential attachment
%    model (PA) and real network: average triangle geographic length for all
%        nodes with a given degree as a function of node degree. }
%    \label{fig:model_triangle_correlation}
%\end{figure*}

\section{Discussion and implications}
\label{sec:disc4}
Our results show that the effect of geographic distance cannot be
neglected when online social networks are studied and modelled: preferential attachment
mechanisms need to be modified into gravity-based mechanisms, which are able to
correctly balance the effects of node attractiveness and the connection costs
imposed by spatial distance. This finding gives rise to interesting implications
about processes that may be driving the actions of individuals.

In reality, preferential attachment and triadic closure together are already able to
reproduce the global social properties observed in real social networks, namely
the degree distribution and the level of clustering. However, neglecting
spatial information about where users are located fails to account for the
effect of distance: while in real systems users preferentially connect to other
users at closer distances, resulting in a considerable fraction of short-range
social ties, models without spatial information give rise to an unlikely  majority 
of long-range connections. This is at odds with plentiful empirical evidence,
both in offline and online social systems.

Our findings support the idea that distance has a simple effect on the creation
of social ties: the probability of connection between two individuals decreases
as a negative power of the spatial distance between them. Yet, this effect must
be combined with a process based on ``popularity'' or ``visibility'' that
introduces heterogeneity across users, such as attachment to the best connected
nodes, in order to fully recreate the self-reinforcing mechanisms that lead to
the scale-free degree distributions observed in social graphs.

Gravity models, already widely and successfully used to understand several types
of spatial systems, provide an elegant and insightful way of combining the
effect of distance and the influence of popularity. The main implication of the
gravity mechanism is that 
one user may be interested in another because the other user
is hugely popular, regardless of their spatial distance, or because the
other user is spatially close, \textit{regardless of popularity and importance}.
The underlying message, which goes beyond the specific scenario of network
growth, is that the powerful First Law of Geography is still at work: near
things are more related than distant things, as stated by Tobler in
1970~\cite{Tob70:geography}.

Surprisingly, the influence of distance on the formation of local network
structure appears negligible. Triadic closure does not seem affected by
geographic proximity between individuals; this suggests that network transitivity
is chiefly ruled by social factors that seem blind to geographic constraints.
The overall picture is that proximity both over space and over the social fabric
greatly fosters the creation of new social links; the result is that the
likelihood of a new connection increases when two individuals share many other
connections or when two individuals are close to each other. We also point out
that no friend recommendation mechanism was in place on the online service under analysis
during the measurement period.

This dual r\^ole of distance in the social and in the spatial dimension has
promising applications in a wide range of systems. In particular, while the
predictive power of social proximity has already been harnessed by a plethora
of friend recommendation systems, spatial closeness has largely been ignored in
such scenarios.  However, geographic proximity is a powerful but simplistic
indicator of social ties that are likely to be created. In fact, our model is
able to reproduce the global social and spatial properties observed in the real
traces, but it could be unable to accurately detect whether two users are going
to connect to each other. This lack of precision is due to a lack of
information: the spatial distance between the locations of two users is only
one variable. In reality, there is more information
about user spatial movements, since we can tap into a rich data source: the
places that users visit. Thus, geographic distance can be used together with
data about user check-ins to provide a more comprehensive picture of user
behaviour, likely to offer higher accuracy in modelling how new social ties are
established.  In Chapter~\ref{ch:prediction} we will discuss how friend
prediction systems can be designed based on this consideration.

%\todo{continue with more implications, talk about the combined effect of social
%and geo, implications for prediction with pointer to chapter 5}

\section{Related work}
\label{sec:related4}
The temporal patterns of network evolution have been the focus of many studies
and several models been put forward to describe the basic mechanisms that may
drive network growth.

% Initial temporal models (BA and variations)
One of the first models of temporal network growth was the simple
yet powerful Barab\'{a}si-Albert (BA) model~\cite{BA99:ba}, based on two key ingredients:
growth and preferential attachment. Inspired by how new Web pages tend to link to
already popular pages, this model reproduces the scale-free degree distribution
observed in several network systems across different scenarios. The rationale
behind the BA model is to focus on the network as a dynamic
entity under continuous evolution: hence, by mimicking the dynamic mechanisms
that assemble the network over time, one will be able to reproduce the
topological properties of the system at the current time.

The properties of the networks generated by the BA model have been extensively
studied and discussed; in particular, we note that BA graphs tend to show a
vanishing level of clustering as the system grows in size,  and also tend to
exhibit negative degree-degree correlations. Hence, the BA model has attracted a considerable
amount of attention in the literature from authors who have tried to modify its
basic mechanisms to introduce such characteristics that are often found in other
networks, such as in social graphs.  


% Temporal properties and models for social networks
A set of works studied the temporal evolution of online social networks,
discussing global properties such as densification and diameter reduction
observed during the growth of the graph~\cite{FMN03:evolution,
KNT06:structure}.  Even though online social graphs
tend to have an heterogeneous degree distribution, in agreement with the
preferential attachment principle, these findings highlighted that, in social
networks, different mechanisms seem to be in place.  Leskovec
et al.~\cite{LKF06:densification} propose a ``forest-fire'' copying process:
when a new node joins the network and connects to a first neighbour, it
starts recursively creating new links among the neighbours of the first
neighbour, effectively copying the connections already in place. This process
mixes preferential attachment, as more connected nodes are more likely to be
selected, and transitivity, which fosters new connections between nodes in
social proximity. 
This confirms that triadic closure is an essential ingredient in the evolution
of social graphs, to generate transitivity and community structure.
    
Several other works have focussed on the
importance of triadic closure for social network evolution: Simmel 
noted that people sharing many friends might be more likely to become
connected~\cite{Sim08:triads}. This effect was then measured in real social
networks~\cite{LK07:prediction, KH06:simmel}  and included in growth models.
With respect to these results, our work explores, for the first time, the effect
of spatial distance on network evolution. Specifically, we study how distance
influences growth mechanisms such as preferential attachment and triadic
closure.

Another large body of work has focussed on general models for spatial
networks. One of the earliest examples is the Waxman model, where nodes are
distributed at random over space and then connected with probability
exponentially decreasing with distance~\cite{Wax88:model}. The Waxman model has
also been modified as a growth model, where new nodes join the network and
connect using a similar rule~\cite{KH04:model}. Barth\'{e}lemy proposed
to combine the preferential attachment rule with spatial distance, studying how the
resulting graphs move away from being scale-free as the effect of spatial
distance is increased~\cite{Bar03:crossover}, however this case only considered
an exponential decay of the effect of distance as in the original Waxman model.
Barrat et al.~\cite{BBV05:weighted} also considered a similar model for
weighted networks where preferential attachment is driven by the weight of the
existing  connections and hampered by greater spatial distance.

While these works contain the initial ideas related to modifying preferential
attachment with spatial influence, they were based on spatial systems  such as
transportation networks that lacked social properties. Hence, they tend to
focus on an exponential decay of the probability of connection as a function of
distance, different to what is observed in social graphs, and they ignore
properties arising from triadic closure. Our contribution builds on these
findings and brings together several different insights in order to obtain a
suitable model for spatial social graphs.

%Another set of works have investigated the spatial properties of social
%networks: the influence of geographic distance on social connections was firstly
%discussed in the LiveJournal community~\cite{Liben2005}. This influence appears so
%important that it can be exploited to predict where people live given their
%friends' locations~\cite{Backstrom_2010}. Other studies on
%mobile phone communication networks have found that
%social triangles tend to extend over large geographic
%distances~\cite{Lambiotte_2008} and that community detection
%approaches should take spatial distance into account to achieve better
%results~\cite{Expert_PNAS2011}.   The fostering effect of
%geographic proximity on social ties has been demonstrated considering both
%purely spatial coincidences~\cite{Crandall_2010} and repeated visits to
%venues~\cite{Scellato2011_kdd}. Our work extends these results by analysing the
%effect of spatial distance not on the static structure of social networks but on
%their temporal evolution.

Finally, we adopt the maximum likelihood methodology, and we base our growth
model on results presented in~\cite{LBKT08:microscopic}, where the evolution of
four different online social networks was discussed.  Again, our work differs as
it addresses the effect of geographic distance on the temporal mechanisms that
govern network evolution, providing a more complete understanding of the factors
driving social behaviour. Furthermore, we describe a model of network growth
that successfully reproduces both social and spatial properties observed in
real social graphs.

\section{Summary}
\label{sec:concl4}
In Chapter~\ref{ch:structure} we put forward the idea that spatial distance
still impacts online social services.  Users tend to exhibit heterogeneous
socio-spatial properties correlated with their number of friends: as they have
more and more connections, these connections tend to span longer geographic
distances. We suggested that, as found in many other
spatial networks, preferential attachment tends to be mitigated by mechanisms
akin to gravitational attraction: nodes still tend to connect to high-degree
hubs even when these are far away, but less connected nodes can still attract
short-range connections from nodes nearby. 

%Our discussion hinted that models where attraction between entities is
%influenced by their spatial distance have been described more than 50 years ago
%as \textit{gravity models}, where the combined effects of the
%importance of a node and the deterrent effect of distance 
%
%have been successfully
%used to describe systems such as transportation networks and highways between
%cities and mobile calls between cities~\cite{Erlander90, Jung08, Krings09}.
%Yet, there has been no investigation on the temporal evolution of a geo-social
%network as it acquires more nodes and more edges: \textit{existing growth models
%  for social networks do not consider the effect of geographic distance between
%    individuals}.

In this chapter we have extended our results concerning the structural properties of the
spatial social graph by investigating the effect of geographic distance on the
temporal evolution of a social network. 
%Adopting a maximum likelihood approach
%to test different models on real data, we have found that combining together node degree and
%spatial distance can reproduce how users create new online social
%connections, with a functional relationship which behaves as gravitational
%attraction. Furthermore, we have studied several triangle closing models,
%  discovering that social factors are more important than spatial ones when two
%  users already have friends in common. Finally, we have explored the temporal
%  properties of network evolution, studying node lifespan and inter-edge
%  temporal intervals.
%Based on our findings, we have defined and tested a gravitational attachment
%growth model reproduces the structural properties observed in the real spatial
%social network.  Our model highlights basic factors driving network evolution
%which could greatly impact a vast range of research efforts and practical
%applications devoted to spatial social networks. 
%We have found that combining together node degree and
%spatial distance can reproduce how users create new online social
%connections, with a functional relationship which behaves as gravitational
%attraction. Furthermore, we have studied several triangle closing models,
%  discovering that social factors are more important than spatial ones when two
%  users already have friends in common. Finally, we have explored the temporal
%  properties of network evolution, studying node lifespan and inter-edge
%  temporal intervals.
Based on our findings, we have defined and tested a gravitational attachment
growth model that reproduces the structural properties observed in the real spatial
social network.  This new model highlights basic factors driving network evolution
which could greatly impact a vast range of research efforts and practical
applications devoted to spatial social networks. 

Our model relies on triangle-closing mechanisms to create new edges and grow
the network. However, this fails to reproduce how new links can be created even
between users that do not share any friend, despite these links being a
minority of all new connections. In addition, our model only considers spatial
distance between users when modelling their behaviour, but in reality users
might exhibit more complex patterns that require more detailed data about where
they are located and where they go.  To overcome these limitations, we
need to exploit a different source of information to connect users who might
not share any friend, but who may geographically close. In the
next chapter we will investigate how the physical places that people visit can
not only bridge this gap, but also help to obtain precise and accurate predictions
about future online social connections.
